Add support for TransformerEngine flash attention in WAN #299

cpersson-amd · 2025-12-16T18:54:02Z

This PR implements the following:

TransformerEngine flash attention for WAN training and inference.
A new fsdp sharding parallelism optimized for use on GPUs.
Some minor changes to allow for training on flax version 0.11.2.

The code has been tested on WAN 2.1 (training and inference) and flux (only training) using GPUs.

google-cla · 2025-12-16T18:54:07Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

entrpn · 2025-12-30T17:43:09Z

@cpersson-amd I've been out on PTO for a month. I'll take a closer look at this next week. Meanwhile, can you update your branch with the latest in main. Thanks.

src/maxdiffusion/configs/base14.yml

src/maxdiffusion/models/wan/transformers/transformer_wan.py

src/maxdiffusion/models/attention_flax.py

entrpn

In general the PR looks good, but I'm still unsure if adding another axes, fsdp_batch, is really necessary. I would prefer not to add it. The other major thing is switching the mesh_axes from data, fsdp, tensor to data, tensor, fsdp.

src/maxdiffusion/configs/base_wan_14b.yml

src/maxdiffusion/models/wan/autoencoder_kl_wan.py

src/maxdiffusion/models/attention_flax.py

entrpn · 2026-01-15T00:21:56Z

@susanbao can you take a quick look at this PR.

src/maxdiffusion/max_utils.py

entrpn · 2026-01-16T17:23:09Z

@cpersson-amd please review Sanbao's comments above and rebase with main. We tested the PR internally and it looks good. Would you be willing to change the axis fsdp to context? If not, I can make the change after this PR is merged.

cpersson-amd · 2026-01-19T11:55:19Z

@entrpn I've rebased with main, included @susanbao requested change and updated the mesh names: fsdp -> context, fsdp_batch -> fsdp. Please let me know if anything else needs to be changed.

…lection

entrpn · 2026-01-20T17:30:24Z

@entrpn I've rebased with main, included @susanbao requested change and updated the mesh names: fsdp -> context, fsdp_batch -> fsdp. Please let me know if anything else needs to be changed.

thanks @cpersson-amd this looks great. Can you run ruff check --fix as the unit tests are failing due to formatting right now.

cpersson-amd · 2026-01-20T17:36:43Z

@entrpn Sure, I ran 'ruff check --fix' and had to manually fix some bare except statements. It should be good with the latest commit

entrpn · 2026-01-20T23:38:00Z

@cpersson-amd Please review my PR to fix some of the unit tests. Once they pass, this can be merged. cpersson-amd#1

cpersson-amd marked this pull request as draft December 17, 2025 00:18

cpersson-amd marked this pull request as ready for review December 17, 2025 10:21

cpersson-amd closed this Dec 17, 2025

cpersson-amd reopened this Dec 17, 2025

cpersson-amd force-pushed the main branch from 9ca3d79 to a7345e2 Compare December 17, 2025 10:39

entrpn reviewed Jan 5, 2026

View reviewed changes

src/maxdiffusion/configs/base14.yml Outdated Show resolved Hide resolved

src/maxdiffusion/models/wan/transformers/transformer_wan.py Show resolved Hide resolved

src/maxdiffusion/models/attention_flax.py Outdated Show resolved Hide resolved

entrpn reviewed Jan 8, 2026

View reviewed changes

src/maxdiffusion/configs/base_wan_14b.yml Outdated Show resolved Hide resolved

src/maxdiffusion/models/wan/autoencoder_kl_wan.py Outdated Show resolved Hide resolved

src/maxdiffusion/models/attention_flax.py Outdated Show resolved Hide resolved

entrpn requested a review from susanbao January 15, 2026 00:21

susanbao reviewed Jan 15, 2026

View reviewed changes

src/maxdiffusion/max_utils.py Outdated Show resolved Hide resolved

cpersson-amd added 11 commits January 20, 2026 15:43

add flash attn te support for wan

eb481e9

add gpu optimized sharding parallelism

6d7a714

sharding bugfixes

70bc2bd

generalize across sharding parallelisms

cf8220b

fix issue with inference using fsdp + te flash attention

cf3eb63

revert fsdp_tpu name change

48b63ce

update readme with wan2.1 gpu notes

13f6408

re-order parallelism axes and revert dynamic context parallel axes se…

f6e284e

…lection

remove unused max_utils imports

cd6abf8

change mesh names to more accurately reflect sharding

f9214aa

cleanup

87f04f4

cpersson-amd force-pushed the main branch from 81dc7ff to 87f04f4 Compare January 20, 2026 16:13

fix lint errors

42969af

entrpn approved these changes Jan 20, 2026

View reviewed changes

susanbao approved these changes Jan 20, 2026

View reviewed changes

Add support for TransformerEngine flash attention in WAN #299

Are you sure you want to change the base?

Add support for TransformerEngine flash attention in WAN #299

Conversation

cpersson-amd commented Dec 16, 2025

Uh oh!

google-cla bot commented Dec 16, 2025

Uh oh!

entrpn commented Dec 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

entrpn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

entrpn commented Jan 15, 2026

Uh oh!

Uh oh!

entrpn commented Jan 16, 2026

Uh oh!

cpersson-amd commented Jan 19, 2026

Uh oh!

entrpn commented Jan 20, 2026

Uh oh!

cpersson-amd commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

entrpn commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpersson-amd commented Jan 20, 2026 •

edited

Loading